Generalized Linear Models

For Over-Dispersed Data

Nathen Byford

Basics of Generalized Linear Models (GLMs)

  • GLMs are a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
  • Components of GLMs:
    • Random Component: Specifies the distribution of the response variable (e.g., Normal, Binomial, Poisson).
    • Systematic Component: A linear predictor, a combination of explanatory variables (predictors).
    • Link Function: Connects the mean of the response variable to the linear predictor (e.g., identity, log, logit).

Over-Dispersed Data

  • Over-dispersion in general refers to having variance greater than that assumed for the theoretical data model.
  • Over-dispersion can also refer to having a variance that is greater than the mean
    • Similarly equi-dispersion would refer to having variance equal to the mean.

Poisson regression

Poisson regression is the most popular method for modeling count data. The Poisson distribution brings with it the assumption of equi-dispersion that is often unsatisfied.

Common applicataions

  • Count data in Biology
  • Epidemiology
  • Finance
  • Insurance claims
  • Environmental studies
  • etc.

Almost any real world count data is subject to the possibility of over-dispersion.

Causes

  • Increased variability of counts
  • Event clustering
  • Increased number of 0
  • Interaction effects
  • Measurement error
  • Environmental effects

Candidate Distributions

  1. Negative-Binomial
  2. Generalized Poisson
  3. Double Poisson
  4. Conway-Maxwell-Poisson (CMP)
  • Zero inflated disitributions
    • ZIP
    • ZINB
    • ZIDP/ZIGP

Negative-Binomial

  • Parameters: mean: \(\mu\), dispersion: \(k\)1
  • Variance: \(\mu + \mu^2/k\)
    • Function of mean and dispersion parameter
    • Clearly captures over-dispersion.

Generalized Poisson

  • Parameters: \(\lambda\), \(\theta\)
  • Mean: \(\lambda / (1- \theta)\)
  • Variance: \(\lambda / (1- \theta)^3\)

\[ f(Y = y) = \frac{\lambda(\lambda + \theta y)^{y-1} e^{-(\lambda + \theta y)}}{y!}, \quad \lambda > 0,\space \theta \in \mathbb{R} \]

  • Designed to extend to overdispersed and underdispersed count data
\(\theta = 0\) \(\theta > 0\) \(\theta < 0\)
reduces to Poisson Models overdispersion Models underdispersion

Double Poisson

  • Parameters: \(\mu\), \(\theta\)
  • variance: \(\mu / \theta\)
  • Extension of the double exponential family (Efron 1986) defined by pmf \[ f(Y = y) = (\theta^{1/2}e^{-\theta\mu}) \left(\frac{e^{y}y^y}{y!}\right) \left(\frac{e \mu}{y}\right)^{\theta y} \]

Conway-Maxwell Poisson (CMP)

  • Parameters: \(\lambda\), \(\nu\)
  • Weighted Poison distribution with pmf: \[ f(Y = y) = \frac{\lambda^y}{(y!)^\nu Z(\lambda, \nu)}, \quad Z(\lambda, \nu) = \sum_{y=0}^\infty \frac{\lambda^y}{(y!)^\nu} \]
  • Includes spacial cases (Sellers et al. 2012) of Poison\((\lambda)\) when \(\nu = 1\), geometric\((\lambda)\)

Model comparisons

Model

Poisson

NB

GP

DIC

1,291.8

1,273.9

1,265.6

  • Bayesian paper compared Poisson, Negative Binomial, and CMP for longitudinal counts using DIC to compare. (Alam et al. 2023)

Model

Poisson

NB

CMP

DIC

1,362.39

1,350.67

1,348.87

Model

CMP

Poisson

Neg-Bin

AIC

5,073

5,589

5,077

Results

  • It has been found and shown that modeling over-dispersed data with impropper distributions leads to biased results.
  • To prevent biased results from over dispersed data, using models such as the CMP, GP, or DP model can prove beneficial
  • Overall the CMP is found to have the most flexibility modeling of over- and under-dispersion due to the GP model truncating for some dispersion parameters

References

Alam, M., Gwon, Y., and Meza, J. (2023), “Bayesian conway-maxwell-poisson (CMP) regression for longitudinal count data,” Communications for Statistical Applications and Methods, 30, 291–309. https://doi.org/10.29220/CSAM.2023.30.3.291.
Efron, B. (1986), “Double exponential families and their use in generalized linear regression,” Journal of the American Statistical Association, 81, 709–721. https://doi.org/10.2307/2289002.
Gschlößl, S., and Czado, C. (2008), “Modelling count data with overdispersion and spatial effects,” Statistical Papers, 49, 531–552. https://doi.org/10.1007/s00362-006-0031-6.
Sellers, K. F., Borle, S., and Shmueli, G. (2012), “The COM-poisson model for count data: A survey of methods and applications,” Applied Stochastic Models in Business and Industry, 28, 104–116. https://doi.org/10.1002/asmb.918.
Sellers, K. F., and Shmueli, G. (2010), A flexible regression model for count data,” The Annals of Applied Statistics, 4, 943–961.